Basic Classification and Regression Trees (CART) Algorithm:
Start with all observations in one group.
Find the variable/split that best separates the response variable (successive binary partitions based on the different predictors / explanatory variables).
Evaluation “homogeneity” within each group
Divide the data into two groups (“leaves”) on that split (“node”).
Within each split, find the best variable/split that separates the outcomes.
Continue until the groups are too small or sufficiently “pure”.
Prune tree.
Minimize heterogeneity
For every observation that falls into the region \(R_m\),
prediction = the mean of the response values for observations in \(R_m\).
\(\Rightarrow\) Minimize Residual Sum of Squares (RSS): \[RSS = \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \overline{y}_{R_m})^2\] where \(\overline{y}_{R_m}\) is the mean response for observations within the \(m\)th region.
Recursive binary splitting
Select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \(\{X | X_j< s\}\) and \(\{X | X_j \geq s\}\) lead to the greatest reduction in RSS.
For any \(j\) and \(s\), define the pair of half-planes to be \[R_1(j,s) = \{X | X_j < s\} \mbox{ and } R_2(j,s) = \{X | X_j \geq s\}\] Find the value of \(j\) and \(s\) that minimize the equation: \[\sum_{i:x_i \in R_1(j,s)} (y_i - \overline{y}_{R_1})^2 + \sum_{i:x_i \in R_2(j,s)} (y_i - \overline{y}_{R_2})^2\]
where \(\overline{y}_{R_1}\) is the mean response for observations in \(R_1(j,s)\) and \(\overline{y}_{R_2}\) is the mean response observations in \(R_2(j,s)\).
Trees in action
Measures of impurity
\(\hat{p}_{mk}\) = proportion of observations in the \(m\)th region from the \(k\)th class.
classification error rate = fraction of observations in the node & not in the most common class:
cross-entropy\[D_m = - \sum_{k=1}^K \hat{p}_{mk} \log \hat{p}_{mk}\] (Gini index & cross-entropy will both take on a value near zero if the \(\hat{p}_{mk}\) values are all near zero or all near one.)
Recursive binary splitting
For any \(j\) and \(s\), define the pair of half-planes to be \[R_1(j,s) = \{X | X_j < s\} \mbox{ and } R_2(j,s) = \{X | X_j \geq s\}\]
Seek the value of \(j\) and \(s\) that minimize the equation: \[\begin{align}
& \sum_{i:x_i \in R_1(j,s)} \sum_{k=1}^K \hat{p}_{{R_1}k}(1-\hat{p}_{{R_1}k}) + \sum_{i:x_i \in R_2(j,s)} \sum_{k=1}^K \hat{p}_{{R_2}k}(1-\hat{p}_{{R_2}k})\\
\\
\mbox{equivalently: } & \\
& n_{R_1} \sum_{k=1}^K \hat{p}_{{R_1}k}(1-\hat{p}_{{R_1}k}) + n_{R_2} \sum_{k=1}^K \hat{p}_{{R_2}k}(1-\hat{p}_{{R_2}k})\\
\end{align}\]
Stopping
We can always make the tree more “pure” by continuing the split.
Too many splits will overfit the model to the training data!
There is a cost to having a larger (more complex!) tree.
Define the cost complexity criterion, \(\alpha > 0:\)\[\begin{align}
\mbox{numerical: } C_\alpha(T) &= \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \overline{y}_{R_m})^2 + \alpha \cdot |T|\\
\mbox{categorical: } C_\alpha(T) &= \sum_{m=1}^{|T|} \sum_{i \in R_m} I(y_i \ne k(m)) + \alpha \cdot |T|
\end{align}\] where \(k(m)\) is the class with the majority of observations in node \(m\) and \(|T|\) is the number of terminal nodes in the tree.
\(\alpha\) small: If \(\alpha\) is set to be small, we are saying that the risk is more worrisome than the complexity and larger trees are favored because they reduce the risk.
\(\alpha\) large: If \(\alpha\) is set to be large, then the complexity of the tree is more worrisome and smaller trees are favored.
In practice
Consider \(\alpha\) increasing. As \(\alpha\) gets bigger, the “best” tree will be smaller.
The training error will be monotonically related to the size of the training tree.
However, the test error will not be monotonically related to the size of the training tree.
A note on \(\alpha\)
In the text (Introduction to Statistical Learning) and almost everywhere else you might look, the cost complexity is defined as in previous slides.
However, you might notice that in R the cost_complexity value is typically less than 1. The value of the function that is being minimized in R is the average of the squared errors and the missclassification rate. (Allows for \(\alpha\) to not depend on sample size.)
Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
Use \(V\)-fold cross-validation to choose \(\alpha\). Divide the training observations into \(V\) folds, then for each \(v=1, 2, \ldots, V\):
Repeat Steps 1 and 2 on all but the \(v\)th fold of the training data.
Evaluate the mean squared prediction error on the data in the left-out \(k\)th fold, as a function of \(\alpha\). For each value of \(\alpha\), average the prediction error (either misclassification or RSS), and pick \(\alpha\) to minimize the average error.
Return the subtree from Step 2 that corresponds to the chosen value of \(\alpha\).
Variance refers to the amount by which would change if we estimated it using a different training set. Generally, the closer the model fits the (training) data, the more variable it will be (it will be different for each data set!). A model with many many explanatory variables will often fit the data too closely.
Bias refers to the error that is introduced by approximating the “truth” by a model which is too simple. For example, we often use linear models to describe complex relationships, but it is unlikely that any real life situation actually has a true linear model. However, if the true relationship is close to linear, then the linear model will have a low bias.
Test and training error as a function of model complexity. Note that the error goes down monotonically only for the training data. Be careful not to overfit!! image credit: ISLR
Reflecting on Model Building
Image credit: https://www.tmwr.org/
Reflecting on Model Building
Image credit: https://www.tmwr.org/
Reflecting on Model Building
Image credit: https://www.tmwr.org/
Bagging preview
Why a forest instead of a tree?
Trees suffer from very high variance
Two datasets will provide extremely different trees
To reduce variability, fit many trees and aggregate the predictions.
No longer reporting the tree structure, now reporting the predicted values (with hopes that predictions from one dataset will be similar to predictions from another dataset).
Bagging
Bagging = Bootstrap Aggregating.
multiple models aggregate together will produce a smoother model fit and better balance between
bias in the fit
variance in the fit.
Bagging can be applied to any classifier to reduce variability.
Recall that the variance of the sample mean is variance of data / n. So we’ve seen the idea that averaging an outcome gives reduced variability.
with bagging, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally for free!!!
Each tree is constructed using a different bootstrap sample
Put all of the cases left out in the construction of the \(b^{th}\) tree down the \(b^{th}\) tree to get a classification.
At the end of the run,
take \(j\) to be the class that got most of the votes every time case \(i\) was oob.
The proportion of times that \(j\) is not equal to the true class averaged over all cases is the oob error.
OOB Prediction
Let the OOB prediction for the \(i^{th}\) observation to be \(\hat{y}_{(-i)}\)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_unknown()
• step_mutate()
• step_naomit()
── Model ───────────────────────────────────────────────────────────────────────
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~num_trees, mtry = min_cols(~7, x), oob.error = ~TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 7
Mean of squared residuals: 92247.19
% Var explained: 85.06
Fit again
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_unknown()
• step_mutate()
• step_naomit()
── Model ───────────────────────────────────────────────────────────────────────
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~num_trees, mtry = min_cols(~7, x), oob.error = ~TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 7
Mean of squared residuals: 92405.61
% Var explained: 85.03